c m p - lg / 9 50 90 02 v 1 7 S ep 1 99 5 Conserving Fuel in Statistical Language Learning : Predicting

نویسنده

  • Mark Lauer
چکیده

The paradigm for nlp known as statistical language learning (sll) has flourished in recent times, being seen as a quick and easy way to get off the ground. Research systems have been launched at many nlp problems including sense disambiguation (Yarowsky, 1992), anaphora resolution (Dagan and Itai, 1990), prepositional phrase attachment (Hindle and Rooth, 1993) and lexical acquisition (Brent, 1993). This has all been fueled by the large text corpora which are increasingly available (Marcus et al., 1993). Since these systems learn to navigate language by consuming text, they are critically dependent on the data that drives them. In this paper I address the practical concern of predicting how much training data is sufficient for a given system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expensive to compute, I also give a close but cheaply computable approximation to it. Finally, I report on a series of simulations exploring the effects of inputs that are not uniformly distributed. 1 Background 1.1 Do We Need To Know? Even though text is becoming increasingly available, it is often expensive, especially if it must be annotated. Consider the decisions facing the sll technology consumer, that is, the architect of a planned commercial nlp system. For each module which is to employ sll, an appropriate technique must be selected. If different techniques require different amounts of data to achieve a given accuracy, the architect would like to know what these requirements are in advance in order to make an informed choice. Further, once the technique is chosen, she must decide how much data to collect or purchase for training. Because this data can be expensive, foreknowledge of data requirements is highly valuable. Thus, in order to make statistical nlp technology practical, a predictive theory of data requirements is needed. Despite this need, very little attention has been paid to the problem.1 ∗This paper has been accepted for publication at the Eigth Australian Joint Conference on Artificial Intelligence, Canberra, 1995. See de Haan (1992) for an investigation of sample sizes for linguistic studies. 1 1.2 Foundations For A Theory All the sll systems mentioned above employ knowledge gained from a corpus to make decisions. Abstractly, this knowledge can be represented as a mapping from observable features (inputs) to decision outcomes (outputs). Following Lauer (1995) I will call each distinguished input a bin and each possible output a value. There is a probability distribution across the bins representing how instances fall into bins. Also, for each bin, there is a probability distribution across the set of values representing how instances in that bin take on values. For the system to perform accurately, most (but not necessarily all) of the instances falling in a particular bin must have the same value. In what follows I will make several assumptions: Training and test data are drawn from the same distributions. The set of possible values is binary (examples include Hindle and Rooth, 1993 and Lauer, 1994). The probability of the most likely value in each bin is constant.2 Finally, I will only consider a simple learning algorithm: collect the training instances falling into each bin and then select the most frequent value for each. This mode-based learner is employed directly in the unigram tagger of Charniak (1993, p49) and is at the heart of many systems. 1.3 Optimal Accuracy There are two sources of error in statistical language learners of the kind we are considering. First, since the values are not necessarily fully determined by the bins, no matter what value the learner assigns to a bin there will always be errors (the optimal error rate). Second, since training data is limited, the learner may not have sufficient data available to acquire accurate rules. The combination of these sources of error results in some degree of inaccuracy for the system. We are interested in estimating the accuracy for various volumes of training data. Since the optimal error rate is independent of the amount of training data, it will always exist no matter how much data is used. As the amount of training data increases we expect the accuracy to get closer to this optimal. Let B be the set of bins, V the set of values, Pr(b) the probability that an instance falls into the bin b and Pr(v | b) the probability of the value v given the bin b. If we denote the most likely value in each bin as vb = argmaxv∈V Pr(v | b), then the expected value of the optimal accuracy is determined by the likelihood of this value occurring in each bin. OA = ∑ b∈B Pr(b) Pr(vb | b) (1) If we know the probability that an algorithm will learn the value v for the bin b (denote this Pr(learn(b) = v)), then we can also calculate the expected accuracy rate: EA = ∑

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ar X iv : a lg - g eo m / 9 50 90 03 v 1 1 S ep 1 99 5 Generators for Symbolic Powers of Ideals

L et I be an ideal, homogeneous with respect to the usual grading, in a polynomial ring R = k[x0, . . . , xn] in n+ 1 variables (over an algebraically closed field k). Denote the graded component of I of degree d by Id, and likewise the k-vector space of homogeneous forms of R of degree d by Rd. Since I is a graded R-module, we have k-linear maps μd,i : Id ⊗ Ri → Id+i given for each i and d by ...

متن کامل

ar X iv : c m p - lg / 9 70 50 02 v 1 1 M ay 1 99 7 Sloppy identity

Although sloppy interpretation is usually accounted for by theories of ellipsis, it often arises in non-elliptical contexts. In this paper, a theory of sloppy interpretation is provided which captures this fact. The underlying idea is that sloppy interpretation results from a semantic constraint on parallel structures and the theory is shown to predict sloppy readings for deaccented and paychec...

متن کامل

ar X iv : h ep - l at / 9 70 90 10 v 1 5 S ep 1 99 7 1 HUPD - 9715 NRQCD Results on Form Factors ∗

We report results on fB and semi-leptonic B decay form factors using NRQCD. We investigate 1/M scaling behavior of decay amplitudes. For fB Effect of higher order relativistic correction terms are also studied.

متن کامل

ar X iv : h ep - l at / 9 90 90 36 v 1 7 S ep 1 99 9 1 Flavour - singlet pseudoscalar and scalar mesons

We measure correlations appropriate for favour-singlet mesons using dynamical quark configurations from UKQCD. Improved methods of evaluating the disconnected quark diagrams are presented. The scalar and pseu-doscalar meson channels are explored, with special reference to the mixing of scalar mesons with scalar glueballs.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007